This script and corresponding output use simulated data based on the original dataset I downloaded from the Alzheimer’s Disease Neuroimaging Initiative. Part of the data agreement terms for ADNI users is that data not be published externally, so I cannot directly share the data upon which I ran my own analysis. However, the simulated data structure (columns, number of rows, subject IDs, etc.) is identical to that downloaded from ADNI; the only differences are simulated tau-PET uptake values, age, sex, and cognitive assessment scores. Any reader who wishes to access the actual dataset used for my analysis should register for an ADNI account (free) and refer to the specific data files described in Data Understanding.
Additionally, this project has been published as a GitHub pages website containing figures and analysis created with the actual ADNI dataset, and can be found here: https://anniegbryant.github.io/DA5030_Final_Project/
Lastly, the Shiny app deployed based on this project can be found here: https://annie-bryant.shinyapps.io/TauPET_Shiny_App_Notebook/
For my final project for DA5030 “Data Mining and Machine Learning”, my objective is to leverage neuroimaging-based data to predict cognitive decline in subjects along the cognitive spectrum from cognitively unimpaired to severe dementia. The goal is to identify specific brain regions that, when burdened by Alzheimer’s Disease-related pathology, confer predictive power onto cognitive status, measured via neuropsychological assessment. Ideally, I would like to identify the regions of interest (ROIs) in the brain that change the most with decreasing cognitive ability and to refine a set of ROIs that collectively predict changes to cognitive assessment scores. This will be (tentatively) regarded as a success if one or more ROIs can explain more than 50% variance in cognitive assessment scores (i.e. R\(^2\) > 0.5).
I will focus on one specific form of neuroimaging: Positron Emission Tomography (PET). PET imaging enables the visualization of specific molecular substrates in the brain through the use of radioactively-labeled tracers that bind the target substrate. In this case, I have chosen to focus on PET that binds to the protein tau, which exhibits characteristic misfolding in Alzheimer’s Disease (AD). Misfolded tau not only loses its normal function, but it also aggregates into intracellular neurofibrillary tangles (NFTs) that can disrupt neuronal signaling and promote neurodegeneration. This phenomenon typically follows an archetypical spreading pattern beginning in the entorhinal cortex, progressing out to the hippocampus and amygdala, and then spreading out beyond the medial temporal lobe to the limbic system and onto the neocortex. This staging pattern is well-defined following the seminal paper published by Braak & Braak in 1991; the stages of tau NFT pathology progression are now known as the Braak stages. There are six stages of tau NFT progression in total.
Such staging has traditionally only been possible at autopsy, as it requires careful immunohistochemical staining of several brain regions by an experienced neuropathologist. However, recent years have seen the development of tau-PET tracers that are specific to misfolded NFT tau. One tracer in particular, 18F-AV-1451, has become widely-used in the last few years as a non-invasive biomarker to measure regional accumulation of tau in the human brain. Tau-PET uptake correlates well with the typical postmortem Braak staging patterns (Schwarz et al. 2016) as well as cognitive status (Zhao et al. 2019). Recent studies have utilized machine learning algorithms with tau-PET neuroimaging, as well as other (relatively) non-invasive biomarkers including amyloid-beta PET and cerebrospinal fluid (CSF) protein measurements, to collectively predict onset of dementia (Mishra et al. 2017) or to predict the spread of tau NFT pathology in the brain (Vogel et al. 2019, 2020). However, longitudinal analysis of tau-PET accumulation and its relationship to cognition remains relatively unexplored as of yet, largely owing to the recentness of tau-PET tracer development.
Through my role as a research assistant at the MassGeneral Institute for Neurogenerative Disease, I have worked with the Alzheimer’s Disease Neuroimaging Initiative (ADNI) data repository previously. ADNI is a tremendous resource for imaging-based and molecular biomarker data acquired from thousands of research participants across the country (see Acknowledgments for more information). In 2016, ADNI incorporated 18F-AV-1451 tau-PET neuroimaging into its imaging protocol, and has since amassed well over a thousand tau-PET scans since then. Researchers at UCSF have processed many of these images and quantified regional uptake of the tau-PET tracer, and have generously shared their regional tau-PET data for ADNI collaborators to access. ADNI has also compiled cognitive assessment scores for each subject. I will utilize these two resources to develop individual regression models as well as an ensemble model to predict cognitive decline as a function of pathological tau NFT accumulation throughout the brain.
The only constraint is that I cannot directly share the full dataset as downloaded from ADNI, though I encourage anyone interested in gaining access to register for free at http://adni.loni.usc.edu/. Instead, I will use the R library fakeR (vignette) to simulate the two datasets I will access from ADNI to publish in my GitHub repository, so that the interested reader can follow along with consistent data structures.
My goal in this analysis is to develop a model that can predict change in cognitive status through some combination (linear or nonlinear) of multiple brain regions, each of which exhibit a different change in tau-PET uptake. In doing this, I also hope to identify which region(s) of the brain are most prone to accumulation of tau NFT pathology as measured via PET, and in turn, which region(s) can best predict cognitive decline.
The target feature in this project will be a continuous measurement representing a score on a cognitive assessment score (CDR Sum of Boxes – see Data Understanding). Therefore, models will be evaluated based on their root mean squared error (RMSE) and the R\(^2\) between predicted versus real cognitive scores. I have set a benchmark of success at R\(^2\) > 0.5, meaning the model explains at least 50% of variance seen in cognitive score changes. This is an ambitious threshold, as cognitive status is multifactorial and certainly modulated by more than regional tau accumulation, but this figure will distinguish stronger versus weaker predictive models.
# General data wrangling
library(tidyverse)
## -- Attaching packages -------------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.3 v dplyr 1.0.1
## v tidyr 1.1.1 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## -- Conflicts ----------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(knitr)
library(kableExtra)
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
library(DT)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(readxl)
library(fakeR)
# Modeling
library(factoextra)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
library(FactoMineR)
library(glmnet)
## Loading required package: Matrix
##
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
##
## expand, pack, unpack
## Loaded glmnet 4.0-2
library(caret)
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
library(ranger)
library(caretEnsemble)
##
## Attaching package: 'caretEnsemble'
## The following object is masked from 'package:ggplot2':
##
## autoplot
library(Hmisc)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
# Visualization
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(forcats)
library(ggsignif)
library(ggcorrplot)
library(psych)
##
## Attaching package: 'psych'
## The following object is masked from 'package:Hmisc':
##
## describe
## The following objects are masked from 'package:ggplot2':
##
## %+%, alpha
library(GGally)
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
library(colorRamps)
library(RColorBrewer)
library(colorspace)
library(NeuralNetTools)
library(ggplotify)
library(igraph)
##
## Attaching package: 'igraph'
## The following object is masked from 'package:plotly':
##
## groups
## The following objects are masked from 'package:lubridate':
##
## %--%, union
## The following objects are masked from 'package:dplyr':
##
## as_data_frame, groups, union
## The following objects are masked from 'package:purrr':
##
## compose, simplify
## The following object is masked from 'package:tidyr':
##
## crossing
## The following object is masked from 'package:tibble':
##
## as_data_frame
## The following objects are masked from 'package:stats':
##
## decompose, spectrum
## The following object is masked from 'package:base':
##
## union
# ggseg is used to visualize the brain
# remotes::install_github("LCBC-UiO/ggseg")
# If that doesn't work:
# download.file("https://github.com/LCBC-UiO/ggseg/archive/master.zip", "ggseg.zip")
# unzip("ggseg.zip")
# devtools::install_local("ggseg-master")
library(ggseg)
# remotes::install_github("LCBC-UiO/ggseg3d")
library(ggseg3d)
# remotes::install_github("LCBC-UiO/ggsegExtra")
library(ggsegExtra)
The longitudinal tau-PET dataset was downloaded as a CSV from the Alzheimer’s Disease Neuroimaging Initiative (ADNI) Study Data repository located at Study Data/Imaging/PET Image Analysis/UC Berkeley - AV1451 Analysis [ADNI2,3] (version: 5/12/2020). This CSV file contains 1,121 rows and 241 columns. Note:ADNI data is freely accessible to all registered users. Please see my Acknowledgments page for more information about ADNI and its contributors.
On my end, I load partial volume corrected regional tau-PET data, as downloaded from ADNI:
tau.df <- read.csv("../ADNI_Data/Raw_Data/UCBERKELEYAV1451_PVC_05_12_20.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%m/%d/%Y")
# update stamp is irrelevant, drop it
tau.df <- select(tau.df, -update_stamp)
However, since I can’t share the tau-PET data directly from ADNI, I’ve simulated this dataset using the fakeR library with the following code:
set.seed(129)
# use simulate_dataset from fakeR
# stealth.level=2 means each column is simulated independently
tau.sim <- simulate_dataset(tau.df, digits=5, stealth.level=2,
ignore=c("RID", "VISCODE", "VISCODE2", "EXAMDATE")) %>%
select(-update_stamp)
write.csv(tau.sim, "Simulated_ADNI_TauPET.csv", row.names = F)
The simulated tau-PET dataset can be loaded as follows:
tau.df <- read.csv("Simulated_ADNI_TauPET.csv")
tau.df$EXAMDATE = as.Date(tau.df$EXAMDATE, format="%Y-%m-%d")
Each row in the CSV represents one tau-PET scan (see str call below). Some subjects had repeated scans separated by approximately one year, while other subjects had only one scan. Columns include subject information including anonymized subject ID, visit code, and PET exam date. The other columns encode regional volume and tau-PET uptake. Specifically, there are 123 distinct cortical and subcortical regions of interest (ROIs), each of which has a volume field (in mm^3) and a tau-PET uptake field, called the Standardized Uptake Value Ratio (SUVR).
str(tau.df)
## 'data.frame': 1120 obs. of 164 variables:
## $ RID : int 21 31 31 56 56 56 59 69 69 69 ...
## $ VISCODE : chr "init" "init" "y1" "init" ...
## $ VISCODE2 : chr "m144" "m144" "m156" "m144" ...
## $ EXAMDATE : Date, format: "2018-02-02" "2018-04-24" ...
## $ INFERIOR_CEREBGM_SUVR : num 0.342 0.452 0.129 0.145 2.957 ...
## $ INFERIOR_CEREBGM_VOLUME : num 56383 65081 56900 62775 52594 ...
## $ HEMIWM_SUVR : num 0.954 0.966 1.004 1.081 1.119 ...
## $ HEMIWM_VOLUME : num 483990 403653 429393 317379 408757 ...
## $ BRAAK12_SUVR : num 2.09 1.46 1.54 1.93 1.35 ...
## $ BRAAK12_VOLUME : num 10119 8682 13232 10611 12208 ...
## $ BRAAK34_SUVR : num 1.64 2.72 1.93 1.47 1.8 ...
## $ BRAAK34_VOLUME : num 116705 97847 106122 122590 103573 ...
## $ BRAAK56_SUVR : num 2.39 1.99 1.84 1.65 2.38 ...
## $ BRAAK56_VOLUME : num 298076 339962 341138 334962 315329 ...
## $ BRAIN_STEM_SUVR : num 1.17 1.23 1.14 1.22 1.18 ...
## $ BRAIN_STEM_VOLUME : num 20725 23551 14970 22162 22223 ...
## $ LEFT_MIDDLEFR_SUVR : num 2.156 1.426 1.938 0.853 1.316 ...
## $ LEFT_MIDDLEFR_VOLUME : num 20625 22012 19429 16616 21708 ...
## $ LEFT_ORBITOFR_SUVR : num 2.02 1.65 2.55 1.52 2.15 ...
## $ LEFT_ORBITOFR_VOLUME : num 11753 12113 11649 12066 11627 ...
## $ LEFT_PARSFR_SUVR : num 1.87 2 1.89 1.71 2.23 ...
## $ LEFT_PARSFR_VOLUME : num 9990 10503 10211 10983 7135 ...
## $ LEFT_ACCUMBENS_AREA_SUVR : num 1.91 2.04 1.5 3.22 1.5 ...
## $ LEFT_ACCUMBENS_AREA_VOLUME : num 355 325 831 418 691 ...
## $ LEFT_AMYGDALA_SUVR : num 0.923 2.092 1.41 0.789 1.76 ...
## $ LEFT_AMYGDALA_VOLUME : num 987 1284 841 1708 1639 ...
## $ LEFT_CAUDATE_SUVR : num 1.36 1.35 1.99 1.31 1.16 ...
## $ LEFT_CAUDATE_VOLUME : num 4226 3156 3132 4798 3438 ...
## $ LEFT_HIPPOCAMPUS_SUVR : num 1.36 1.57 1.28 1.94 2.22 ...
## $ LEFT_HIPPOCAMPUS_VOLUME : num 2644 4958 2849 2426 2993 ...
## $ LEFT_PALLIDUM_SUVR : num 3.05 1.91 2.2 2.12 2.12 ...
## $ LEFT_PALLIDUM_VOLUME : num 1485 1354 1360 1446 1400 ...
## $ LEFT_PUTAMEN_SUVR : num 1.81 2.07 1.07 1.91 1.73 ...
## $ LEFT_PUTAMEN_VOLUME : num 6113 3982 6187 6479 4979 ...
## $ LEFT_THALAMUS_PROPER_SUVR : num 1.44 1.16 1.3 1.08 1.19 ...
## $ LEFT_THALAMUS_PROPER_VOLUME : num 4916 7297 6316 7172 6165 ...
## $ RIGHT_MIDDLEFR_SUVR : num 2.16 1.74 2.41 1.41 1.56 ...
## $ RIGHT_MIDDLEFR_VOLUME : num 23643 17915 21423 19796 22511 ...
## $ RIGHT_ORBITOFR_SUVR : num 2.45 2.53 1.69 2.17 1.98 ...
## $ RIGHT_ORBITOFR_VOLUME : num 12020 11800 12455 12198 11621 ...
## $ RIGHT_PARSFR_SUVR : num 2.3 1.71 2.12 2.41 1.44 ...
## $ RIGHT_PARSFR_VOLUME : num 7173 10629 10659 9281 8880 ...
## $ RIGHT_ACCUMBENS_AREA_SUVR : num 1.08 1.49 1.5 1.85 1.74 ...
## $ RIGHT_ACCUMBENS_AREA_VOLUME : num 452 424 375 550 566 ...
## $ RIGHT_AMYGDALA_SUVR : num 0.359 0.987 1.605 1.883 2.486 ...
## $ RIGHT_AMYGDALA_VOLUME : num 1462 1637 781 1317 1882 ...
## $ RIGHT_CAUDATE_SUVR : num 1.66 1.15 1.83 1.86 1.39 ...
## $ RIGHT_CAUDATE_VOLUME : num 4361 4264 3728 4369 4000 ...
## $ RIGHT_HIPPOCAMPUS_SUVR : num 1.48 1.99 1.06 1.43 2.05 ...
## $ RIGHT_HIPPOCAMPUS_VOLUME : num 3550 4934 3266 4490 3745 ...
## $ RIGHT_PALLIDUM_SUVR : num 1.93 2.09 1.72 2.18 2.22 ...
## $ RIGHT_PALLIDUM_VOLUME : num 1730 1335 1409 899 1502 ...
## $ RIGHT_PUTAMEN_SUVR : num 1.83 1.61 1.31 1.49 1.33 ...
## $ RIGHT_PUTAMEN_VOLUME : num 5996 6245 5846 3153 5146 ...
## $ RIGHT_THALAMUS_PROPER_SUVR : num 1.23 1.18 1.44 1.17 1.64 ...
## $ RIGHT_THALAMUS_PROPER_VOLUME : num 7278 5221 6663 7635 6595 ...
## $ CHOROID_SUVR : num 4.03 2.26 3.23 3.61 1.44 ...
## $ CHOROID_VOLUME : num 3250 3629 5913 4229 4444 ...
## $ CTX_LH_BANKSSTS_SUVR : num 1.42 1.9 1.14 2.9 3.7 ...
## $ CTX_LH_BANKSSTS_VOLUME : num 2397 1945 2031 2778 1494 ...
## $ CTX_LH_CAUDALANTERIORCINGULATE_SUVR : num 1.32 1.87 1.63 1.62 2.04 ...
## $ CTX_LH_CAUDALANTERIORCINGULATE_VOLUME : num 1056 1447 1922 2292 830 ...
## $ CTX_LH_CUNEUS_SUVR : num 1.87 1.4 1.33 2.85 2.03 ...
## $ CTX_LH_CUNEUS_VOLUME : num 3024 2512 2730 3196 2642 ...
## $ CTX_LH_ENTORHINAL_SUVR : num 1.48 0.737 2.628 1.373 2.87 ...
## $ CTX_LH_ENTORHINAL_VOLUME : num 2536 2632 725 2091 1424 ...
## $ CTX_LH_FUSIFORM_SUVR : num 1.74 2.19 1.41 3.37 2.63 ...
## $ CTX_LH_FUSIFORM_VOLUME : num 7062 8214 8871 7792 9423 ...
## $ CTX_LH_INFERIORPARIETAL_SUVR : num 1.504 2.807 2.974 2.104 0.895 ...
## $ CTX_LH_INFERIORPARIETAL_VOLUME : num 9906 11466 11637 13180 9685 ...
## $ CTX_LH_INFERIORTEMPORAL_SUVR : num 1.19 2.77 2.24 1.26 3.73 ...
## $ CTX_LH_INFERIORTEMPORAL_VOLUME : num 9919 9647 10694 7474 6432 ...
## $ CTX_LH_INSULA_SUVR : num 1.95 1.55 1.8 1.5 1.55 ...
## $ CTX_LH_INSULA_VOLUME : num 5793 7211 5311 7230 7132 ...
## $ CTX_LH_ISTHMUSCINGULATE_SUVR : num 2.32 2.54 1.66 2.36 1.05 ...
## $ CTX_LH_ISTHMUSCINGULATE_VOLUME : num 1907 2075 2882 2013 2545 ...
## $ CTX_LH_LATERALOCCIPITAL_SUVR : num 2.46 2.75 2.49 2.41 3.66 ...
## $ CTX_LH_LATERALOCCIPITAL_VOLUME : num 7550 10710 10421 7259 11461 ...
## $ CTX_LH_LINGUAL_SUVR : num 2.11 2.5 1.51 2.34 2.18 ...
## $ CTX_LH_LINGUAL_VOLUME : num 6765 4697 6816 6056 4720 ...
## $ CTX_LH_MIDDLETEMPORAL_SUVR : num 1.875 2.573 1.462 0.896 2.273 ...
## $ CTX_LH_MIDDLETEMPORAL_VOLUME : num 9657 10869 6444 9428 10033 ...
## $ CTX_LH_PARACENTRAL_SUVR : num 1.37 2.08 1.77 1.62 1.83 ...
## $ CTX_LH_PARACENTRAL_VOLUME : num 3004 2869 2835 2917 3642 ...
## $ CTX_LH_PARAHIPPOCAMPAL_SUVR : num 2 1.53 2.4 2.19 1.65 ...
## $ CTX_LH_PARAHIPPOCAMPAL_VOLUME : num 1704 3116 1873 1627 2168 ...
## $ CTX_LH_PERICALCARINE_SUVR : num 1.253 1.695 1.557 0.918 1.628 ...
## $ CTX_LH_PERICALCARINE_VOLUME : num 1878 1491 1933 1844 1980 ...
## $ CTX_LH_POSTCENTRAL_SUVR : num 2.12 1.57 2.08 1.9 2.09 ...
## $ CTX_LH_POSTCENTRAL_VOLUME : num 8950 6142 9305 8798 8728 ...
## $ CTX_LH_POSTERIORCINGULATE_SUVR : num 1.64 1.1 1.32 1.75 1.92 ...
## $ CTX_LH_POSTERIORCINGULATE_VOLUME : num 3152 2631 2581 1706 3536 ...
## $ CTX_LH_PRECENTRAL_SUVR : num 1.77 2.18 1.6 1.33 1.84 ...
## $ CTX_LH_PRECENTRAL_VOLUME : num 10643 13116 13763 9665 12526 ...
## $ CTX_LH_PRECUNEUS_SUVR : num 1.909 0.468 0.583 1.627 3.145 ...
## $ CTX_LH_PRECUNEUS_VOLUME : num 8184 7041 7739 9173 8416 ...
## $ CTX_LH_ROSTRALANTERIORCINGULATE_SUVR : num 1.58 1.7 1.81 1.35 2.19 ...
## $ CTX_LH_ROSTRALANTERIORCINGULATE_VOLUME: num 2737 2240 2087 2341 2199 ...
## $ CTX_LH_SUPERIORFRONTAL_SUVR : num 1.448 1.731 1.1 2.507 0.999 ...
## [list output truncated]
The SUVR value is normalized to the tau-PET uptake in the inferior cerebellum gray matter (highlighted in blue below), a commonly-used region for tau normalization given the lack of inferior cerebellar tau pathology in Alzheimer’s Disease.
aseg_3d %>%
unnest(ggseg_3d) %>%
ungroup() %>%
select(region) %>%
na.omit() %>%
mutate(val=ifelse(region %in% c("Right-Cerebellum-Cortex", "Left-Cerebellum-Cortex"), 1, 0)) %>%
ggseg3d(atlas=aseg_3d, label="region", text="val", colour="val", na.alpha=0.5,
palette=c("transparent", "deepskyblue3"), show.legend=F) %>%
add_glassbrain() %>%
pan_camera("left lateral") %>%
remove_axes()
## Warning: `arrange_()` is deprecated as of dplyr 0.7.0.
## Please use `arrange()` instead.
## See vignette('programming') for more help
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.